The LIMSI Continuous Speech Dictation Systemt

نویسندگان

  • Jean-Luc Gauvain
  • Lori Lamel
  • Gilles Adda
  • Martine Adda-Decker
چکیده

A major axis of research at LIMSI is directed at multilingual, speaker-independent, large vocabulary speech dictation. In this paper the LIMSI recognizer which was evaluated in the ARPA NOV93 CSR test is described, and experimental results on the WSJ and BREF corpora under closely matched conditions are reported. For both corpora word recognition expenrnents were carried out with vocabularies containing up to 20k words. The recognizer makes use of continuous density HMM with Gaussian mixture for acoustic modeling and n-gram statistics estimated on the newspaper texts for language modeling. The recognizer uses a time-synchronous graph-search strategy which is shown to still be viable with a 20kword vocabulary when used with bigram back-off language models. A second forward pass, which makes use of a word graph generated with the bigram, incorporates a trigram language model. Acoustic modeling uses cepstrum-based features, context-dependent phone models (intra and interword), phone duration models, and sexdependent models. I N T R O D U C T I O N Speech recognition research at LIMSI aims to develop recognizers that are task-, speaker-, and vocabulary-independent so as to be easily adapted to a variety of applications. The applicability of speech recognition techniques used for one language to other languages is of particular importance in Europe. The multilingual aspects are in part carried out in the context of the LRE SQALE (Speech recognizer Quality Assessment for Linguistic Engineering) project, which is aimed at assessing language dependent issues in multilingual recognizer evaluation. In this project, the same system will be evaluated on comparable tasks in different languages (English, French and German) to determine cross-lingual differences, and different recognizers will be compared on the same language to compare advantages of different recognition strategies. In this paper some of the primary issues in large vocabulary, speaker-independent, continuous speech recognition for dictation are addressed. These issues include language modeling, acoustic modeling, lexical representation, and search. Acoustic modeling makes use of continuous density HMM with Gaussian mixture of context-dependent phone models. For language modeling n-gram statistics are estimated on tThis work is partially funded by the LRE project 62-058 SQALE. 319 text material. To deal with phonological variability alternate pronunciations are included in the lexicon, and optional phonological rules are applied during training and recognition. The recognizer uses a time-synchronous graph-search strategy[16] for a first pass with a bigram back-off language model (LM)[10]. A trigram LM is used in a second acoustic decoding pass which makes use of the word graph generated using the bigram LM[6]. Experimental results are reported on the ARPA Wall Street Journal (WSJ)[19] and BREF[14] corpora, using for both corpora over 37k utterances for acoustic training and more than 37 million words of newspaper text for language model training. While the number of speakers is larger for WSJ, the total amount of acoustic training material is about the same (see Table 1). It is shown that for both corpora increasing the amount of training utterances by an order of magnitude reduces the word error by about 30%. The use of a trigram LM in a second pass also gives an error reduction of 20% to 30%. The combined error reduction is on the order of 50%. L A N G U A G E M O D E L I N G Language modeling entails incorporating constraints on the allowable sequences of words which form a sentence. Statistical n-gram models attempt to capture the syntactic and semantic constraints by estimating the frequencies of sequences of n words. In this work bigram and trigram language models are estimated on the training text material for each corpus. This data consists of 37M words of the WSJ 1 and 38M words of Le Monde. A backoff mechanism[10] is used to smooth the estimates of the probabilities of rare n-grams by relying on a lower order n-gram when there is insufficient training data, and to provide a means of modeling unobserved n-grams. Another advantage of the backoff mechanism is that LM size can be arbitrarily reduced by relying more on the backoff, by increasing the minimum number of required n-gram observations needed to include the n-gram. This property can be used in the first bigram decod1While we have built n-gram-backoff LMs directly from the 37M-word standardized WSJ training text material, in these experiments all results are reported using the 5k or 20k, bigram and tfigram backoff LMs provided by Lincoln Labs[ 19] as required by ARPA so as to be compatible with the other sites participating in the tests. ing pass to reduce computational requirements. The trigram langage model is used in the second pass of the decoding process:. In order to be able to constnact LMs for BREF, it was necessary to normalize the text material of Le Monde newpaper, which entailed a pre-treatment rather different from that used to normalize the WSJ texts[19]. The main differences are in the treatment of compound words, abbreviations, and case. In BREF the distinction between the cases is kept if it designates a distinctive graphemic feature, but not when the upper case is simply due to the fact that the word occurs at the beginning of the sentence. Thus, the first word of each sentence was semi-automatically verified to determine if a transformation to lower case was needed. Special treatment is also needed for the symbols hyphen (-), quote ('), and period (.) which can lead to ambiguous separations. For example, the hyphen in compound words like beaux-arts and au-dessus is considered word-internal. Alternatively the hyphen may be associated with the first word as in ex-, or anti-, or with the second word as in -Id or -nL Finally, it may appear in the text even though it is not associated with any word. The quote can have two different separations: it can be word internal (aujourd' hui, o'Donnel, hors-d'oeuvre), or may be part of the first word (l'aml). Similarly the period may be part of a word, for instance, L.A., sec. (secondes), p. (page), or simply an end-of-sentence mark. Table 1 compares some characteristics of the WSJ and Le Monde text corpora. In the same size training texts, there are almost 60% more distinct words for Le Monde than for WSJ without taking case into account. 2 As a consequence, the lexical coverage for a given size lexicon is smaller for Le Monde than for WSJ. For example, the 20k WSJ lexicon accounts for 97.5% of word occurrences, but the 20k BREF lexicon only covers 94.9% of word occurrences in the training texts. For lexicons in the range of 5k to 40k words, the number of words must be doubled for Le Monde in order to obtain the same word coverage as for WSJ. The lexical ambiguity is also higher for French than for English. The homophone rate (the number of words which have a homophone divided by the total number of words) in the 20k BREF lexicon is 57% compared to 9% in 20k-open WSJ lexicon. This effect is even greater if the word frequencies are taken into account. Given a perfect phonemic transcription, 23% of words in the WSJ training texts is ambiguous, whereas 75% of the words in the Le Monde training texts have an ambiguous phonemic transcription. Not only does one phonemic form correspond to different orthographic forms, there can also be a relatively large number of possible pronunciations for a given word. In French, the alternate pronunciations arise mainly from optional word-final phones, due to liaison and optional word-final consonant cluster re2If case is kept when distinctive, there are 280k words in the Le Monde training material. Corpus ][ WSJ Le Monde # training speakers 284 80 # training utterances 37.5k 38.5k Training text size 37.2M 37.7M #distinct words 165k 259k (280) 5k coverage 90.6% 85.5% (85.2) 20k coverage 97.5% 94.9% (94.7) Homophone rate 20k lexicon 9% 57% Homophone rate 20k text 23% 75% Monophone words (2Ok) 3% 17% Table 1: Comparison of WSJ and BREF corpora. ruction (see Figure 1). There are also a larger number of frequent, monophone words for Le Monde than for WSJ, accounting for about 17% and 3% of all word occurrences in the respective training texts. A C O U S T I C P H O N E T I C M O D E L I N G The recognizer makes use of continuous density HMM (CDHMM) with Gaussian mixture for acoustic modeling. The main advantage continuous density modeling offers over discrete or semi-continuous (or tied-mixture) observation density modeling is that the number of parameters used to model an HMM observation distribution can easily be adapted to the amount of available training data associated to this state. As a consequence, high precision modeling can be achieved for highly frequented states without the explicit need of smoothing techniques for the densities of less frequented states. Discrete and semi-continuous modeling use a fixed number of parameters to represent a given observation density and therefore cannot achieve high precision without the use of smoothing techniques. This problem can be alleviated by tying some states of the Markov models in order to have more training data to estimate each state distribution. However, since this kind of tying requires careful design and some a priori assumptions, these techniques are primarily of interest when the training data is limited and cannot easily be increased. In the experimental section we demonstrate the improvement in performance obtained on the same test data by simply using additional training material. A 48-component feature vector is computed every 10 ms. This feature vector consists of 16 Bark-frequency scale cepstrum coefficients computed on the 8kHz bandwidth and their first and second order derivatives. For each frame (30 ms window), a 15 channel Bark power spectrum is obtained by applying triangular windows to the DFT output. The cepstrum coefficients are then computed using a cosinus transform [2]. The acoustic models are sets of context-dependent(CD), position independent phone models, which include both intra-word and cross-word contexts. The contexts are automatically selected based on their frequencies in the training data. The models include tfiphone models, fightand

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The LIMSI continuous speech dictation system: evaluation on the ARPA Wall Street Journal task

In this paper we report progress made at LIMSI in speakerindependent large vocabulary speech dictation using the ARPA Wall Street Journal-based CSR corpus. The recognizer makes use of continuous density HMM with Gaussian mixture for acoustic modeling and n-gram statistics estimated on the newspaper texts for language modeling. The recognizer uses a time-synchronous graph-search strategy which i...

متن کامل

Speaker-independent continuous speech dictation

In this paper we report progress made at LIMSI in speaker-independent large vocabulary speech dictation using newspaper speech corpora. The recognizer makes use of continuous density HMM with Gaussian mixture for acoustic modeling and n-gram statistics estimated on the newspaper texts for language modeling. Acoustic modeling uses cepstrum-based features, contextdependent phone models (intra and...

متن کامل

Continuous speech dictation in French

A major research activity at LIMSI is multilingual, speaker-independent, large vocabulary speech dictation. In this paper we report on efforts in large vocabulary, speaker-independent continuous speech recognition of French using the BREF corpus. Recognition experiments were carried out with vocabularies containing up to 20k words. The recognizer makes use of continuous density HMM with Gaussia...

متن کامل

Developments in continuous speech dictation using the 1995 ARPA NAB news task

In this paper we report on the LIMSI recognizer evaluated in the ARPA 1995 North American Business (NAB) News benchmark test. In contrast to previous evaluations, the new Hub 3 test aims at improving basic SI, CSR performance on unlimitedvocabulary read speech recorded under more varied acoustical conditions (background environmental noise and unknown microphones). The LIMSI recognizer is an HM...

متن کامل

Spoken Language Processing in the Framework of Human-Machine Communication at LIMSI

The paper provides an overview of the research conducted at LIMSI in the field of speech processing, but also in the related areas of Human-Machine Communication, including Natural Language Processing, Non Verbal and Multimodal Communication. Also presented are the commercial applications of some of the research projects. When applicable, the discussion is placed in the framework of internation...

متن کامل

Developments in Large Vocabulary Dictation : The LIMSI Nov 94 NAB System yJ

In this paper we report on our development work in large vocabulary , American English continuous speech dictation on the ARPA NAB task in preparation for the November 1994 evaluation. We have experimented with (1) alternative analyses for the acoustic front end, (2) the use of an enlarged vocabulary of 65k words so as to reduce the number of errors due to out-of-vocabulary words, (3) extension...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1994